Assignment #5 - Cats Photo Editing

Part 1: Inverting the Generator

In this section, we invert the generator by solving a nonconvex optimization problem in the latent space of a pretrained StyleGAN model using the w+ representation. The following results are obtained by applying different combinations of loss functions during inversion, including an Lp (L1) loss that enforces pixel level similarity, a perceptual loss that preserves high level features from a pretrained network, and an L2 regularization term on the latent update (delta) to constrain the optimization. The following outputs below show how each combination impacts the reconstruction quality, highlighting improvements in image detail and background fidelity when using the perceptual loss and regularization .

1. various combinations of the losses including Lp loss, Preceptual loss and/or regularization loss that penalizes L2 norm of delta:

Original Image

L1 pixel loss weight 0

perc loss weight 0.01

Regularization Loss Weight 0.001

L1 pixel loss weight: 0

perc loss weight: 0

Regularization Loss Weight: 0.001

L1 pixel loss weight: 1

perc loss weight: 1

Regularization Loss Weight: 0.1

L1 pixel loss weight: 10

perc loss weight: 1

Regularization Loss Weight: 0.1

L1 pixel loss weight: 10

perc loss weight: 0

Regularization Loss Weight: 0

L1 pixel loss weight: 0

perc loss weight: 0

Regularization Loss Weight: 0.1

L1 pixel loss weight: 0.1

perc loss weight: 0.01

Regularization Loss Weight: 0.001

2. different generative models including vanilla GAN, StyleGAN:

The following results are obtained by optimizing a combination of L1 and perceptual losses over 1000 iterations on the z latent space. All experiments are run on A100 GPU. Our experiments demonstrate that although the vanilla GAN is computationally efficient (~40s) due to its streamlined architecture, its reconstructions are significantly less detailed and realistic compared to the higher fidelity outputs generated by StyleGAN (~47s).

Original Image

Vanilla GAN

StyleGAN

3. different latent space (latent code in z space, w space, and w+ space):

The following results are obtained by using different latent spaces on the StyleGAN model, while optimizing a combination of L1 and perceptual losses over 1000 iterations. All experiments are run on A100 GPU, each took around (~47s).

Latent Space

Example 1

Example 2

original image

z space

w space

w+ space

4. Give comments on why the various outputs look how they do. Which combination gives you the best result and how fast your method performs:

- When using the z latent space for StyleGAN, the optimization becomes particularly challenging because gradients must pass through the additional mapping network, often leading to reconstructions that remain very close to the initial state rather than converging to the input.

- Both w and w+ latent spaces provide improved reconstruction quality, the w+ space delivers greater detail and background fidelity owing to its increased expressiveness and flexibility.

- experiments show that vanilla GAN inversions take about 40 seconds per image, while inversions based on StyleGAN require roughly 47 seconds per image.

- The results indicate that GAN outputs tend to be unstable and that using perceptual loss is essential for generating images that closely match the reference image, whereas regularizing the latent update (delta) has little effect.

- The best performance in our experiments is achieved using StyleGAN with either the w or w+ latent spaces, as these methods capture the input image more accurately than using the z space, and StyleGAN outperforms vanilla GAN for inversion tasks.

- The best result is obtained by using the StyleGAN with w+ space, Lp loss weight 10, perceptual loss weight 0.01, and regularization loss weight 0.0. The method performs 1000 iterations within 47s.

Part 2: Scribble to Image

Sketch Image

Mask Image

Image

I drew this cat

I drew this cat

Using StyleGAN with the w+ latent space and optimizing with L1 and perceptual losses over 1000 iterations, most outputs accurately reflect the mask and capture essential details. However, in some cases the generated image remains too similar to the original training sample or introduces unrealistic features to satisfy the sketch constraints (7th and 9th rows). The color of the sketch itself also plays a critical role, as atypical hues or backgrounds can lead to less convincing results (last two rows).

Part 3: Stable Diffusion

Show some example outputs of your guided image synthesis on at least 2 different input images:

Input Image

Prompt

Strength

Steps

Output

Grumpy cat reimagined as a royal painting

15

700

Grumpy cat reimagined as a royal painting

15

700

Show some example outputs of your guided image synthesis on 2 different amounts of noises added to the input:

Input Image

Prompt

Strength

Steps

Output

A regal oil painting of a grumpy looking cat, highly detailed, realistic fur texture, dramatic lighting, vintage art style

15

500

A regal oil painting of a grumpy looking cat, highly detailed, realistic fur texture, dramatic lighting, vintage art style

15

700

A regal oil painting of a grumpy looking cat, highly detailed, realistic fur texture, dramatic lighting, vintage art style

15

1000

Show some example outputs of your guided image synthesis on 2 different classifier-free guidance strength values:
For this part, we fix steps to 700 and use the same prompt on multiple sketch images with varying values of guidance strength.
Prompt: A regal oil painting of a grumpy looking cat, highly detailed, realistic fur texture, dramatic lighting, vintage art style.

Input Image

Strength 3.5

Strength 6.0

Strength 15

Bells & Whistles (Extra Points)

Interpolate between two latent codes in the GAN model, and generate an image sequence (2pt):

Implement additional types of constraints. (3pts each): e.g., sketch/shape constraint and warping constraints mentioned in the iGAN paper, or texture constraint using a style loss.:
(1) sketch/shape constraint: We have implemented an extra edge constraint based on a Sobel operator. This constraint is a sketch/shape constraint, it computes edge maps from both the generated image and the reference sketch, and then applies an L1 loss between these edge maps. This helps to enforce that the generated image retains the structural details, such as outlines or edges, that are present in the input sketch.

0_data.png

0_mask.png

0_250.png

0_500.png

0_750.png

0_1000.png

The additional edge constraint nudges the generator to align with the edges of the scribble, preserving crucial contours and outlines in the final image so that it better reflects the structure of the input sketch.

We place a call in the Criterion and by setting weight for edge constraint loss to 1


        def compute_sobel_edges(img):
        """
        Compute an approximate edge map using the Sobel operator.
        img: Tensor of shape (B, C, H, W) with values in [0, 1].
        Returns a tensor of shape (B, 1, H, W) representing edge magnitudes
        """
        sobel_x = torch.tensor([[-1., 0., 1.],
                                 [-2., 0., 2.],
                                 [-1., 0., 1.]], device=img.device).view(1, 1, 3, 3)
        sobel_y = torch.tensor([[-1., -2., -1.],
                                 [ 0.,  0.,  0.],
                                 [ 1.,  2.,  1.]], device=img.device).view(1, 1, 3, 3)
        gray = img.mean(dim=1, keepdim=True)
        edge_x = F.conv2d(gray, sobel_x, padding=1)
        edge_y = F.conv2d(gray, sobel_y, padding=1)
        edges = torch.sqrt(edge_x ** 2 + edge_y ** 2)
        return edges
    
    def edge_loss(generated, sketch):
        """
        Compute an L1 loss between the edge maps of the generated image and the input sketch.
        generated, sketch: Tensors of shape (B, C, H, W) with values in [0, 1]
        """
        gen_edges = compute_sobel_edges(generated)
        sketch_edges = compute_sobel_edges(sketch)
        return F.l1_loss(gen_edges, sketch_edges)

(2) Texture loss: Texture loss uses a style loss computed via Gram matrices extracted from the pretrained network, to capture and enforce similar texture patterns between the generated image and a reference texture image. what it does, it encourages the generated image to adopt the local texture statistics, such as color distributions and patterns of the provided texture reference, improving the overall style consistency of the output.

4_data.png

4_mask.png

4_250.png

4_500.png

4_1000.png

Refrence texture image

By applying a high texture weight (1.0) alongside minimal edge and shape weights, the generated images strongly incorporate patterns from the reference zebra like texture, causing the cat’s fur and background to adopt striping or bold textural elements. With lower edge/shape constraints, the sketch outlines have less influence on structure, allowing the texture to dominate the final style.

We place a call in the Criterion and by setting weight for edge constraint loss to 1


  class TextureLoss(nn.Module):
    def __init__(self, layers=[3, 8, 17, 26]):  # example layer indices
        super(TextureLoss, self).__init__()
        # use a pretrained VGG19 feature extractor
        self.vgg = vgg19(pretrained=True).features.eval().to(device)
        for param in self.vgg.parameters():
            param.requires_grad = False
        self.layers = layers

    def gram_matrix(self, features):
        B, C, H, W = features.size()
        features = features.view(B, C, H * W)
        gram = torch.bmm(features, features.transpose(1, 2))
        return gram / (C * H * W)

    def forward(self, generated, texture):
        loss = 0.0
        x = generated
        y = texture
        gen_features = []
        tex_features = []
        for i, layer in enumerate(self.vgg):
            x = layer(x)
            y = layer(y)
            if i in self.layers:
                gen_features.append(x)
                tex_features.append(y)
        for gf, tf in zip(gen_features, tex_features):
            loss += F.mse_loss(self.gram_matrix(gf), self.gram_matrix(tf))
        return loss

16-726 Learning-Based Image Synthesis (Spring 2025)